Web traffic is highly skewed，我们可以通过缓存提高 performance。缓存内容可以是 query, result page, inverted list。

Caching of Popular Results

Query distribution

query rank 和 frequency 符合长尾分布。Top 25 queries 占了 1% 的流量。在 distinct queries 里，

64% occur once
16% occur twice
7% occur three times
14% occur >=3 times
average query frequency: 4

RAM & DISK

给 query cache 分配 RAM
储存标准 queries，按字母顺序排列 term
1.6GB cache 储存 40 million queries (40 bytes/query)
给 result page cache 分配磁盘
一页 30KB uncompressed, 10KB compressed
400GB cache 可以存 40 million result pages
Cache misses use RAM only(very fast)
Cache hits use RAM+disk
比正常 evaluate query 要快
只用一台机器

RAM only

给 query cache 分配 RAM
储存标准 queries，按字母顺序排列 term
9MB cache 储存 300,000 queries
给 result page cache 分配 RAM
一页 30KB uncompressed, 10KB compressed
2.1GB cache 存 210,000 compressed result pages
Cache misses and hits only use RAM (very fast)
因为缓存的比较少，所以更多的 query 会被 miss
Could partition caches across multiple machines
需要更复杂的 design

Cache size?

Markatos 提出，30% 的 queries 会与缓存里的 query 匹配，然而增加 cache size 只能非常小幅度的提高 hit rate。见下图：

根据 UK2007 的 query log，44% 的 query 只出现了一次，56% 的 query 出现了超过一次，cache 这 56% 里的 query 有助于提高 performance，然而并不能帮助 first occurrence of a query。

Caching Inverted List

根据 UK2007 的 query log，4% 的 query term 只出现了一次，96% 的 query term 出现不止一次，对 96% 里对 query term 进行 inverted list 的 cache 才是有用的。在每个 partition 上分配一部分 RAM (a few GB)给 inverted list。这里的重点在于 which terms should be cached? 两个原则：

Terms that are frequent in a query log (improve the hit rage)
Terms that don’t have massive inverted lists (consume limited cache space)

对 term 的排序： $$Score(t)={qtf(t) \over df(t)}$$。